STAT 679: Problem Set #3

Q1. Glacial Lakes

The data at this link contain labels of glacial lakes the Hindu Kush Himalaya, created during an ecological survey in 2015 by the International Centre for Integrated Mountain Development.

Part (a)

a.1 How many lakes are in this dataset?

lakes = read_sf("../data/GL_3basins_2015.geojson")
num_lakes = length(unique(lakes$GL_ID))

There are 3624 lakes in this data set.

a.2 What are the latitude / longitude coordinates of the largest lakes in each Sub-basin?

largest_lakes = group_by(lakes, Sub_Basin) %>% 
  filter(Area == max(Area)) %>% 
  select(GL_ID,Sub_Basin,Area,Latitude,Longitude)
GL_ID Sub_Basin Area Latitude Longitude geometry
GL086304E28374N Arun 4.0192059 28.37403 86.30475 POLYGON ((86.29541 28.35189…
GL085838E28322N Sun Koshi 5.4113912 28.32223 85.83813 POLYGON ((85.84835 28.32543…
GL086925E27898N Dudh Koshi 1.3428310 27.89853 86.92510 POLYGON ((86.9127 27.89883,…
GL087866E27869N Tamor 0.6961043 27.86953 87.86618 POLYGON ((87.85986 27.86149…
GL086447E27946N Tama Koshi 1.7197430 27.94679 86.44713 POLYGON ((86.43581 27.93609…
GL085717E28042N Indrawati 0.0269401 28.04209 85.71757 POLYGON ((85.71581 28.04205…
GL086542E27713N Likhu 0.0903048 27.71321 86.54286 POLYGON ((86.54486 27.71284…
GL082948E29196N Bheri 4.9235690 29.19634 82.94822 POLYGON ((82.94605 29.21763…
GL082423E29384N Tila 0.4322915 29.38408 82.42375 POLYGON ((82.42288 29.38101…
GL081554E29648N Karnali 0.1262245 29.64815 81.55488 POLYGON ((81.55656 29.64652…
GL082414E29753N Mugu 0.4283468 29.75376 82.41434 POLYGON ((82.41277 29.75636…
GL081526E29772N West Seti 0.5075936 29.77282 81.52618 POLYGON ((81.52834 29.77348…
GL081577E29897N Kawari 0.2839747 29.89755 81.57738 POLYGON ((81.57937 29.89932…
GL081780E30128N Humla 0.7565747 30.12892 81.78076 POLYGON ((81.77613 30.13338…
GL080178E30564N Kali 0.2396768 30.56462 80.17877 POLYGON ((80.18417 30.56559…
GL084116E28446N Seti 0.1024868 28.44681 84.11694 POLYGON ((84.11637 28.44483…
GL085519E28467N Trishuli 0.4539165 28.46783 85.51938 POLYGON ((85.51812 28.47, 8…
GL084628E28596N Budhi Gandaki 0.2674945 28.59618 84.62826 POLYGON ((84.63338 28.59795…
GL083851E28690N Marsyangdi 3.3841276 28.69089 83.85184 POLYGON ((83.84369 28.70758…
GL083701E29218N Kali Gandaki 0.4344556 29.21850 83.70156 POLYGON ((83.7026 29.21713,…

Part (b)

Plot the polygons associated with each of the lakes identified in part (a).

Hint: You may find it useful to split lakes across panels using the tm_facets function. If you use this approach, make sure to include a scale with tm_scale_bar(), so that it is clear that lakes have different sizes.

tm_shape(largest_lakes) + 
  tm_polygons(col='Area', palette="Blues",legend.show=F) + 
  tm_facets(by='Sub_Basin', free.scales = F, ncol=5, scale.factor = 4) +
  tm_scale_bar(width=0.5) + 
  tm_layout(
    main.title = "Largest lakes in each Sub Basin", main.title.position = c('center','top'), main.title.fontfamily = plot_font, 
    panel.label.bg.color="darkgoldenrod4", panel.label.fontface='bold', panel.label.color = page_bg_color, panel.label.fontfamily = plot_font,
    bg.color = "darkseagreen", outer.bg.color = page_bg_color, 
    )

Part (c)

Visualize all lakes with latitude between 28.2 and 28.4 and with longitude between 85.8 and 86. Optionally, add a basemap associated with each lake.

lakes_subset = lakes %>% 
  filter((Latitude>=28.2 & Latitude<=28.4) & (Longitude>=85.8 & Longitude<=86.0))

basemap = cc_location(loc=c(85.9,28.3), buffer=1e4)

tm_shape(basemap) + 
  tm_rgb(alpha=0.9) + 
  tm_shape(lakes_subset) + 
  tm_polygons(col='deepskyblue3') + 
  tm_layout(bg.color=page_bg_color, inner.margins=c(0,0), frame=F, saturation=-1)

Q2. Australian Pharmaceuticals II

The PBS data set contains the number of orders filed every month for different classes of pharmaceutical drugs, as tracked by the Australian Pharmaceutical Benefits Scheme. The code below takes the full PBS data set and filters down to the 10 most commonly prescribed pharmaceutical types. This problem will ask you to implement and compare two approaches to visualizing this data set.

pbs_full <- read_csv("../data/PBS_random.csv") %>% 
  mutate(Month = as.Date(Month))

top_atcs <- pbs_full %>%
  group_by(ATC2_desc) %>%
  summarise(total = sum(Scripts)) %>%
  slice_max(total, n = 10) %>%
  pull(ATC2_desc)

pbs <- pbs_full %>%
  filter(ATC2_desc %in% top_atcs, Month > "2007-01-01")

Part (a)

Implement a stacked area visualization of these data.

ggplot(pbs) + 
  geom_area(aes(Month,Scripts/1e6, col=ATC2_desc, fill = ATC2_desc), alpha=0.6) +
  scale_x_date(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), n.breaks=11) + 
  scale_fill_brewer(palette = "Paired") + 
  scale_colour_brewer(palette = "Paired") + 
  labs(
    title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date", 
    y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
  )

Part (b)

Implement an alluvial visualization of these data.

ggplot(pbs) +
  geom_alluvium(aes(Month, Scripts/1e6, fill = ATC2_desc, col = ATC2_desc, alluvium = ATC2_desc), decreasing = FALSE, alpha = 0.6) +
  scale_x_date(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), n.breaks=11) + 
  scale_fill_brewer(palette = "Paired") + 
  scale_colour_brewer(palette = "Paired") + 
  labs(
    title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date", 
    y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
  )

Part (c)

Compare and contrast the strengths and weaknesses of these two visualization strategies. Which user queries are easier to answer using one approach vs. the other?

Visualization Strategy Strengths Weakensses
Stacked Area Areas stacked on top of each other makes it easy to tell how the totals at any given time-point breakdown.
We can easily answer user queries on what drugs had the highest sales at a time point, and also identify overall trend.
Not easy to rank the drug classes on their relative sales.
Alluvial Ranking is easy to identify. At any given time-point, the streams are decreasingly ordered by drug sales contribution.
So, user-queries on what drug has the highest/lowest sales, or change in drug sales across time-points can ve easily answered with this visualization.
Not easy to gauge the total sales at a single glance as it works more as a comparison visualization.

Q3. Spotify Time Series II

The code below provides the number of Spotify streams for the 40 tracks with the highest stream count in the Spotify 100 dataset for 2017. This problem will ask you to explore a few different strategies that can be used to visualize this time series collection.

spotify_full <- read_csv("../data/spotify.csv")

top_songs <- spotify_full %>% 
  group_by(track_name) %>%
  summarise(total = sum(streams)) %>%
  slice_max(total, n = 40) %>%
  pull(track_name)

top_10_songs <- spotify_full %>% 
  group_by(track_name) %>%
  summarise(total = sum(streams)) %>%
  slice_max(total, n = 10) %>%
  pull(track_name)

spotify <- spotify_full %>% filter(track_name %in% top_songs)

# Data Preprocessing to cut down a long song name
spotify$track_name[spotify$track_name==unique(spotify$track_name)[14]] <- "I Don’t Wanna Live Forever"
spotify$track_name = trimws(str_replace(spotify$track_name, " \\(([^)]+)\\)", ""))

Part (a)

Design and implement a line-based visualization of these data.

# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(x=date)) + 
  geom_line(aes(date, streams/1e6, group=track_name),col="forestgreen", alpha=0.5,size=0.6) +
  scale_x_date(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0), breaks = seq(0,12,1)) + 
  labs(title="Spotify's 40 top streamed songs in 2017", x="",y="Streams (in millions)") +
  theme(axis.ticks.x = element_line(), axis.text.x=element_text(size=7, hjust=0.7), panel.grid.minor = element_blank())

Part (b)

Design and implement a horizon plot visualization of these data.

# cutpoints = seq(0,12,1)
# mid_point = cutpoints[2]

custom_greys_palette = colorRampPalette(c("black","#EAEAEA"), space = "Lab")
custom_greens_palette = colorRampPalette(c("#BAE4B3","forestgreen"), space = "Lab")
spotify_palette = c((custom_greys_palette(8)),custom_greens_palette(4))
# spotify_palette = c(custom_greys_palette(9), rev(brewer.pal(3, "Greens")))

# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(date,streams/1e6)) + 
  geom_horizon(aes(fill=..Cutpoints..), origin = 4, horizonscale = seq(0,12,1), alpha=0.8) + 
  facet_wrap(~ reorder(track_name, -streams), ncol=1, strip.position = 'left') +
  scale_x_date(expand = c(0,0)) + 
  scale_y_continuous(expand = c(0,0), breaks = seq(0,12,3)) + 
  scale_fill_manual(values = spotify_palette) + 
  labs(title="Spotify's top streamed songs in 2017", x="",y="",fill="Streams (in millions)") + 
  guides(fill = guide_legend(nrow=1, reverse=T)) +
  theme(
    axis.ticks.x = element_line(), 
    axis.text.x=element_text(hjust=0.9),
    axis.text.y=element_blank(), 
    strip.text.y.left = element_text(angle=0, hjust=0.5), 
    legend.position = "top")

Part (c)

Building from the static views from (a - b), propose, but do not implement, a visualization of this data set that makes use of dynamic queries. What would be the structure of interaction, and how would the display update when the user provides a cue? Explain how you would use one type of D3 selection or mouse event to implement this hypothetical interactivity.

Response:

I have two ideas for dynamic queries for these temporal visualizations.

  1. Adding a brush interaction inside a facet. When the user brushes over a region inside one facet, then that time period will be highlighted (and probably zoomed in too) for all the song facets. This would provide a microscopic analysis of the song streams in a small period of time. I would also add in a “Reset Zoom” action button to rest to original view. A non-graphical alternative to this query is to replace the brush with a date range slider.
  2. A multi-selec tool to filter artists, or tracks from the datasets. Facets will appear/disappear based on the filter chosen.

The above interactive queries can also be combined together to have a highly responsive and information dense visualization.

Q4 - CalFresh Enrollment II

In this problem, we will develop an interactively linked spatial and temporal visualization of enrollment data from CalFresh, a nutritional assistance program in California. We will use D3 in this problem

Part (a)

Using a line generator, create a static line plot that visualizes change in enrollment for every county in California.